Named Entity Recognition

Overview

Named Entity Recognition (NER) is a basic Information extraction task in which words (or phrases) are classified into pre-defined entity groups (or marked as non interesting). Entity groups share common characteristics of consisting words or phrases and are identifiable by the shape of the word or context in which they appear in sentences. Examples of entity groups are: names, numbers, locations, currency, dates, company names, etc.

Example sentence:

John is planning a visit to London on October
|                           |         |
Name                        City      Date

In this example, a name, city and date entities are identified.

Datasets

In this example we used publicly available NER datasets used in common research papers. The data must be divided into train and test sets, preprocessed and tokenized and tagged with a finite set of entities in BIO format.

The dataset files must be processed into tabular format where each entry is of the following format:

<token> <tag_1> ... <tag_n>

In the above format each sentence is separated by an empty line. Each line consists of a single sentence tokens with tags divided by white spaces (or any whitespace dividers).

Data loader

Loading data into the model can be done using the SequentialTaggingDataset data loader which can be used with the prepared train and test data sets described above.

The data loader returns 2 Numpy matrices: 1. sparse word representation of the sentence words 2. sparse word character representation of sentence words

The user has a choice to use any representation or both when training models.

Model

The NER model is based on the Bidirectional LSTM with Conditional Random Field sequence classifier published in a paper by Lample et al.

The model has 2 inputs:

  1. sentence words - converted into dense word embeddings or loaded from an external pre-trained word embedding model.
  2. character embedding - trained using the words of the sentences.

A high level overview of the model is provided in figure below:

_images/ner_crf_model.png

Feature generation

NER words or phrases can sometimes be easily identified by the shape of the words, by pre-built lexicons, by Part-of-speech analysis or rules combining patterns of the above features. In many other cases, those features are not known or non existent and the context in which the words appear provide the indication whether a word or a phrase is an entity.

With the help of RNN topologies we can use LSTMs to extract the character based features of words. In this model we use convolutions to extract n-grams features from the characters making up words. A similar approach with RNNs takes the last state of a BiLSTM layer as a representation of the character embeddings. More info on character embedding can be found in the paper.

Prediction layer

The main tagger model consists of a bidirectional LSTM layers. The input of the LSTM layers consists of a concatenation of the word embedding vector and the character embedding vector (provided by the character embedding network).

Finally, the output of the LSTM layers are merged into a fully-connected layer (for each token) and fed into a Conditional Random Field classifier. Using CRF has been empirically shown to provide more accurate models when compared to single token prediction layers (such as a softmax layer).

Running Modalities

Training

Quick train

Train a model with default parameters given input data files:

python examples/ner/train.py --train_file train.txt --test_file test.txt

Full training parameters

All customizable parameters can be obtained by running: python examples/ner/train.py -h

-h, --help show this help message and exit
-b B Batch size
-e E Number of epochs
--train_file TRAIN_FILE
 Train file (sequential tagging dataset format)
--test_file TEST_FILE
 Test file (sequential tagging dataset format)
--tag_num TAG_NUM
 Entity labels tab number in train/test files
--sentence_length SENTENCE_LENGTH
 Max sentence length
--word_length WORD_LENGTH
 Max word length in characters
--word_embedding_dims WORD_EMBEDDING_DIMS
 Word features embedding dimension size
--character_embedding_dims CHARACTER_EMBEDDING_DIMS
 Character features embedding dimension size
--char_features_lstm_dims CHAR_FEATURES_LSTM_DIMS
 Character feature extractor LSTM dimension size
--entity_tagger_lstm_dims ENTITY_TAGGER_LSTM_DIMS
 Entity tagger LSTM dimension size
--dropout DROPOUT
 Dropout rate
--embedding_model EMBEDDING_MODEL
 Path to external word embedding model file
--model_path MODEL_PATH
 Path for saving model weights
--model_info_path MODEL_INFO_PATH
 Path for saving model topology
--use_cudnn use CUDNN based LSTM cells

The model will automatically save the model weights and topology information after training is complete (user can provide file names as above).

Interactive mode

The provided interactive.py file enables using a pre-trained model in interactive mode, providing input directly from stdin.

Run python examples/ner/interactive.py -h for a full list of options:

--model_path MODEL_PATH
 Path of model weights
--model_info_path MODEL_INFO_PATH
 Path of model topology

Quick example:

python examples/ner/interactive.py --model_path model.h5 --model_info_path model_info.dat

References

  1. Neural Architectures for Named Entity Recognition - Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016